Skip to content

Conversation

@jfactory07
Copy link
Contributor

@jfactory07 jfactory07 commented Jan 7, 2026

Motivation

  • Enable new address modes: Add support for B address interleave and K-alignment (KRingShift) in TensileLite codegen.

  • Correctness in tail paths: Ensure tail global reads behave correctly when KRS is disabled at runtime (sgprKRingShift==0) by falling back to the original load-only behavior.

  • Performance: Reduce redundant tail offset work by hoisting invariants and interleaving offset-apply with buffer_load in the tail path.

Technical Details

  • BAddrInterleave (e77420a)

    • Adds BAddrInterleave validation/knob and ISA capability wiring.
    • Computes runtime G once and reuses it across SRD/address calculations (kept live in SGPRs).
  • KRingShift align-k (cec7d49)

    • Adds KRingShift knob and per-workgroup initialization of sgprKRingShift based on cacheline constraints.
    • Applies KRS adjustment to computed global addresses and introduces reference-style tail offset remap macros.
  • Tail-path refinements (5ef3eb7)

    • Moves KRS tail offset patching to just-in-time per-load emission (setup once; apply right before each load).
    • Adds runtime branching so sgprKRingShift==0 takes a no-KRS load-only path; otherwise executes the KRS-enabled interleaved path (including shared A/B label flow when applicable).
    • Fixes tail LDS “zero-out mask” control flow to be conditional (only skip when aligned) and skips the mask sequence when KRS is enabled (since KRS already forces safe OOB behavior).
    • Ensures SGPR cleanup: emits .set ... , UNDEF for KRS/BInterleaveG after last use to avoid accidental remapping.
  • Macro/rocisa robustness (448c4d6)

    • Converts KRS tail offset macros to rocisa.code.Macro API.
    • Fixes RegisterContainer::toString() handling for macro arg register ranges to prevent invalid expansions.
    • Forces specific literals to print as intended (e.g., 0xffffffff).

Test Plan

  • Codegen build: Run Tensile library generation for gfx950 (e.g., asm-debug + keep-build-tmp) and confirm codegen completes without asm errors.
  • Assembly inspection:
    • Confirm KRS markers/macros appear as expected and macro expansions are valid.
    • Validate tail-path behavior:
      • sgprKRingShift==0 → load-only path (no KRS offset apply)
      • sgprKRingShift!=0 → KRS-enabled interleaved path
  • Runtime validation:
    • Run hipblaslt-bench with cases that toggle KRS enable/disable (shapes where cacheline congruence permits/disables KRS) and compare correctness vs baseline.

Test Result

benchmark test for 2048x3072x1880 TN: 9% uplift

Submission Checklist

# K ring-shift (restricted): apply a per-WG shift along the summation (K) dimension so that
# the B-side base K address for each workgroup is cacheline-aligned/congruent, while preserving
# correctness via full-loop ring wrap. Intended for TN/NN-like B (TLUB == False).
"KRingShift": [False, True],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you add the default value of these two new parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, their default values are currently set to false.

 - require tiles1 = SizeJ / MT1 to be an integer (SizeJ % MT1 == 0)
    #   - require lowbit(tiles1) > 1 so that G=min(lowbit(tiles1), LVCB) is > 1 (enabled)
    # Note: if lowbit(tiles1) == 1, then G==1 and the kernel disables BAddrInterleave.
@jfactory07 jfactory07 marked this pull request as ready for review January 23, 2026 06:06
@jfactory07 jfactory07 requested a review from a team as a code owner January 23, 2026 06:06
@jfactory07 jfactory07 changed the title codegen: BAddrInterleave codegen: handle unaligned K for TN Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants